Add Cortex-M as a first-class target in aot_arm_compiler#17075
Add Cortex-M as a first-class target in aot_arm_compiler#17075psiddh wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17075
Note: Links to docs will display an error until the docs builds have been completed. ❌ 7 New Failures, 3 Cancelled Jobs, 2 Unrelated FailuresAs of commit e6fd05b with merge base f06a1f6 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
39666cd to
7f14a9d
Compare
1b64ef3 to
41462be
Compare
| ) | ||
|
|
||
| # Cortex-m ops are never included in vgf or direct-drive | ||
| if args.target != "vgf" and not args.direct_drive: |
There was a problem hiding this comment.
Should TOSA targets even have CortexM fallback ? ( --target=u55/u85 → TOSA delegation)
There was a problem hiding this comment.
Pull request overview
This PR enables full MobileNetV2 lowering to the CMSIS-NN backend for Cortex-M microcontrollers by implementing comprehensive support for quantized operations through a dedicated compilation path. The changes replace the previous delegation-based approach with a portable kernel-based architecture that converts all quantized operations to cortex_m::* operators.
Changes:
- Added dedicated Cortex-M compilation path (
to_edge_cortex_m) in the AOT compiler with CortexMQuantizer-based quantization - Implemented addmm operator support for decomposed linear layers through new
_get_addmm_replacementmethod - Enhanced quantization parameter propagation with new
PropagateQParamsPassand passthrough op handling inFoldAndAnnotateQParamsPass - Extended quantizer to mark parameter nodes as annotated and added passthrough ops (hardtanh, max_pool2d, dropout)
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/arm/aot_arm_compiler.py | Adds to_edge_cortex_m function for Cortex-M compilation path using CortexMQuantizer and removes old transform_for_cortex_m_backend function |
| backends/cortex_m/quantizer/quantizer.py | Adds _mark_param_node_as_annotated method and extends passthrough ops list for MobileNetV2 support |
| backends/cortex_m/passes/propagate_qparams_pass.py | New pass to propagate qparams through passthrough ops (transpose/permute) to consumer nodes like addmm |
| backends/cortex_m/passes/cortex_m_pass_manager.py | Adds PropagateQParamsPass and DecomposeAdaptiveAvgPool2dPass to pass list, adds skip_passes parameter to __init__ |
| backends/cortex_m/passes/convert_to_cortex_m_pass.py | Implements _get_addmm_replacement method to convert decomposed linear (addmm) operations to cortex_m.quantized_linear |
| backends/arm/_passes/fold_qdq_with_annotated_qparams_pass.py | Adds passthrough ops (hardtanh, relu, clamp) support and second-pass qparams propagation logic |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d7d85fb to
b222911
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _mark_param_node_as_annotated(self, node: Node) -> None: | ||
| """ | ||
| Mark a weight/bias parameter node as annotated. | ||
|
|
||
| This is necessary for FoldAndAnnotateQParamsPass to recognize the node | ||
| as part of a quantized computation path. The ARM quantizer does this | ||
| via mark_annotated=True in _QuantProperty. | ||
| """ | ||
| if Q_ANNOTATION_KEY not in node.meta: | ||
| node.meta[Q_ANNOTATION_KEY] = QuantizationAnnotation() | ||
| node.meta[Q_ANNOTATION_KEY]._annotated = True | ||
| annotation_info = ArmAnnotationInfo(quantized=True) | ||
| meta_custom = node.meta.get("custom", {}) | ||
| meta_custom[ArmAnnotationInfo.CUSTOM_META_KEY] = dict(annotation_info) | ||
| node.meta["custom"] = meta_custom |
There was a problem hiding this comment.
The implementation of _mark_param_node_as_annotated duplicates the exact logic from mark_node_as_annotated in backends/arm/quantizer/arm_quantizer_utils.py. Consider importing and reusing the existing function instead of duplicating the code to improve maintainability and reduce the risk of divergence.
There was a problem hiding this comment.
Hi, this PR needs major changes I'm afraid.
- The changes to fold_qdq_with_annotated_qparams_pass and propagate_qparams_pass are very likely not needed, rather they are masking a faulty implementation either of the add_mm or the integration in the aot_arm_compiler.
- The addition of the add_mm is a significant change which should be made in a separate PR properly tested with unittests as is done with all other ops.
- It would be great to add mv2 also as a pytest similar to mv3, in fact I would suggesting starting to get that working before adding support to the aot_arm_compiler since the compilation pipeline is guaranteed to be working there.
Sure - I agree with the approach. I just wanted to share the work I've been up to recently so Context on the design choice: The Cortex-M backend keeps addmm directly (vs ARM's decomposition to Conv2D) to leverage CMSIS-NN's optimized linear When PyTorch decomposes nn.Linear to edge dialect: The weight flows through a transpose before reaching addmm: FoldAndAnnotateQParamsPass folds the DQ into permute, but output_qparams remains empty (no Q node after permute). Proposed approach:
This way we have proper test coverage before discussing the implementation details. Let me get the unit tests |
|
Sounds good!
I think the issue here is that you are not using the EdgeCompileConfig used in the tester: When linear is not decomposed you avoid the issues around q/dq folding. In general the design philosophy is that we want to make the decompositions and annotations to get correct q/dq values directly rather than handling special cases in the folding, as that gets complex very quickly from our previous experience in the arm backend. |
Previously, Cortex-M op conversion was applied as an afterthought to all
non-vgf targets via transform_for_cortex_m_backend(). This made the flow
hard to follow, used a bare EdgeCompileConfig that decomposed ops like
linear into addmm (requiring unnecessary workarounds), and didn't use the
CortexMQuantizer or CortexMPassManager.
Add a dedicated to_edge_cortex_m() path selected via --target=cortex-m that
owns the full pipeline: CortexMQuantizer for INT8 quantization, correct
EdgeCompileConfig with preserve_ops to prevent premature decomposition, and
CortexMPassManager.pass_list for op conversion. Remove the old scattered
transform_for_cortex_m_backend() function.
Verified all ops fully lowered to cortex_m::quantized_* operators for both
MobileNetV2 (70 nodes) and MobileNetV3 (122 nodes). E2E inference tested
on Alif E8 board.
Test Plan:
python3 -m examples.arm.aot_arm_compiler -m mv2 --target=cortex-m --quantize --intermediates=./mv2_intermediates --output=./mv2_cortex_m.pte
python3 -m examples.arm.aot_arm_compiler -m mv3 --target=cortex-m --quantize --intermediates=./mv3_intermediates --output=./mv3_cortex_m.pte
Also ran E2E inference on Alif E8 board
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pass_instances = [] | ||
| for pass_cls in CortexMPassManager.pass_list: | ||
| sig = inspect.signature(pass_cls.__init__) | ||
| if "exported_program" in sig.parameters: | ||
| pass_instances.append(pass_cls(edge.exported_program())) | ||
| else: | ||
| pass_instances.append(pass_cls()) | ||
| edge = edge.transform(pass_instances) |
There was a problem hiding this comment.
Manual pass instantiation duplicates logic from CortexMPassManager.transform(). The code here manually inspects each pass class and instantiates it based on whether it accepts an exported_program parameter, which duplicates the exact same logic already present in CortexMPassManager.transform(). Consider simplifying this by directly using the CortexMPassManager instead of manually instantiating passes. For example: pass_manager = CortexMPassManager(edge.exported_program()); edge_ep = pass_manager.transform(); edge = EdgeProgramManager({\"forward\": edge_ep}, ...)
There was a problem hiding this comment.
Already tried the CortexMPassManager approach, and it broke things — CortexMPassManager.transform() returns an ExportedProgram, not an EdgeProgramManager. The edge object was left untransformed, resulting in 351 raw aten ops.
The cleanest approach: add an instantiate_passes method to CortexMPassManager that extracts the inspect logic into a reusable method. Then both transform() and to_edge_cortex_m() can use it without duplication, which can be a follow up PR
Add a dedicated to_edge_cortex_m() path selected via --target=cortex-m that
owns the full pipeline: CortexMQuantizer for INT8 quantization, correct
EdgeCompileConfig with preserve_ops to prevent premature decomposition, and
CortexMPassManager.pass_list for op conversion. Remove the old scattered
transform_for_cortex_m_backend() function.
Verified all ops fully lowered to cortex_m::quantized_* operators for both
MobileNetV2 (70 nodes) and MobileNetV3 (122 nodes). E2E inference tested
on Alif E8 board.
Test Plan:
- python3 -m examples.arm.aot_arm_compiler -m mv2 --target=cortex-m --quantize --intermediates=./mv2_intermediates --output=./mv2_cortex_m.pte
- python3 -m examples.arm.aot_arm_compiler -m mv3 --target=cortex-m --quantize --intermediates=./mv3_intermediates --output=./mv3_cortex_m.pte